215 research outputs found
Deep Unrestricted Document Image Rectification
In recent years, tremendous efforts have been made on document image
rectification, but existing advanced algorithms are limited to processing
restricted document images, i.e., the input images must incorporate a complete
document. Once the captured image merely involves a local text region, its
rectification quality is degraded and unsatisfactory. Our previously proposed
DocTr, a transformer-assisted network for document image rectification, also
suffers from this limitation. In this work, we present DocTr++, a novel unified
framework for document image rectification, without any restrictions on the
input distorted images. Our major technical improvements can be concluded in
three aspects. Firstly, we upgrade the original architecture by adopting a
hierarchical encoder-decoder structure for multi-scale representation
extraction and parsing. Secondly, we reformulate the pixel-wise mapping
relationship between the unrestricted distorted document images and the
distortion-free counterparts. The obtained data is used to train our DocTr++
for unrestricted document image rectification. Thirdly, we contribute a
real-world test set and metrics applicable for evaluating the rectification
quality. To our best knowledge, this is the first learning-based method for the
rectification of unrestricted document images. Extensive experiments are
conducted, and the results demonstrate the effectiveness and superiority of our
method. We hope our DocTr++ will serve as a strong baseline for generic
document image rectification, prompting the further advancement and application
of learning-based algorithms. The source code and the proposed dataset are
publicly available at https://github.com/fh2019ustc/DocTr-Plus
DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction
In this work, we propose a new framework, called Document Image Transformer
(DocTr), to address the issue of geometry and illumination distortion of the
document images. Specifically, DocTr consists of a geometric unwarping
transformer and an illumination correction transformer. By setting a set of
learned query embedding, the geometric unwarping transformer captures the
global context of the document image by self-attention mechanism and decodes
the pixel-wise displacement solution to correct the geometric distortion. After
geometric unwarping, our illumination correction transformer further removes
the shading artifacts to improve the visual quality and OCR accuracy. Extensive
evaluations are conducted on several datasets, and superior results are
reported against the state-of-the-art methods. Remarkably, our DocTr achieves
20.02% Character Error Rate (CER), a 15% absolute improvement over the
state-of-the-art methods. Moreover, it also shows high efficiency on running
time and parameter count. The results will be available at
https://github.com/fh2019ustc/DocTr for further comparison.Comment: This paper has been accepted by ACM Multimedia 202
DocScanner: Robust Document Image Rectification with Progressive Learning
Compared with flatbed scanners, portable smartphones are much more convenient
for physical documents digitizing. However, such digitized documents are often
distorted due to uncontrolled physical deformations, camera positions, and
illumination variations. To this end, we present DocScanner, a novel framework
for document image rectification. Different from existing methods, DocScanner
addresses this issue by introducing a progressive learning mechanism.
Specifically, DocScanner maintains a single estimate of the rectified image,
which is progressively corrected with a recurrent architecture. The iterative
refinements make DocScanner converge to a robust and superior performance,
while the lightweight recurrent architecture ensures the running efficiency. In
addition, before the above rectification process, observing the corrupted
rectified boundaries existing in prior works, DocScanner exploits a document
localization module to explicitly segment the foreground document from the
cluttered background environments. To further improve the rectification
quality, based on the geometric priori between the distorted and the rectified
images, a geometric regularization is introduced during training to further
improve the performance. Extensive experiments are conducted on the Doc3D
dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative
evaluation results verify the effectiveness of DocScanner, which outperforms
previous methods on OCR accuracy, image similarity, and our proposed distortion
metric by a considerable margin. Furthermore, our DocScanner shows the highest
efficiency in runtime latency and model size
Masked Motion Predictors are Strong 3D Action Representation Learners
In 3D human action recognition, limited supervised data makes it challenging
to fully tap into the modeling potential of powerful networks such as
transformers. As a result, researchers have been actively investigating
effective self-supervised pre-training strategies. In this work, we show that
instead of following the prevalent pretext task to perform masked
self-component reconstruction in human joints, explicit contextual motion
modeling is key to the success of learning effective feature representation for
3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP)
framework. To be specific, the proposed MAMP takes as input the masked
spatio-temporal skeleton sequence and predicts the corresponding temporal
motion of the masked human joints. Considering the high temporal redundancy of
the skeleton sequence, in our MAMP, the motion information also acts as an
empirical semantic richness prior that guide the masking process, promoting
better attention to semantically rich temporal regions. Extensive experiments
on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP
pre-training substantially improves the performance of the adopted vanilla
transformer, achieving state-of-the-art results without bells and whistles. The
source code of our MAMP is available at https://github.com/maoyunyao/MAMP.Comment: To appear in ICCV 202
SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
In fisheye images, rich distinct distortion patterns are regularly
distributed in the image plane. These distortion patterns are independent of
the visual content and provide informative cues for rectification. To make the
best of such rectification cues, we introduce SimFIR, a simple framework for
fisheye image rectification based on self-supervised representation learning.
Technically, we first split a fisheye image into multiple patches and extract
their representations with a Vision Transformer (ViT). To learn fine-grained
distortion representations, we then associate different image patches with
their specific distortion patterns based on the fisheye model, and further
subtly design an innovative unified distortion-aware pretext task for their
learning. The transfer performance on the downstream rectification task is
remarkably boosted, which verifies the effectiveness of the learned
representations. Extensive experiments are conducted, and the quantitative and
qualitative results demonstrate the superiority of our method over the
state-of-the-art algorithms as well as its strong generalization ability on
real-world fisheye images.Comment: Accepted to ICCV 202
- …